\(~\)
Primary material:
These slides
\(~\)
Secondary material:
\(~\)
See also References and further reading (last slide), for further reading material.
\(~\)
\(~\)
\(~\)
\(~\)
Neural networks (NN) were first introduced in the 1990’s.
Shift from statistics to computer science and machine learning, as they are highly parameterized
Statisticians were skeptical: ``It’s just a nonlinear model’’.
After the first hype, NNs were pushed aside by boosting and support vector machines.
Revival since 2010: The emergence of as a consequence of improved computer resources, some innovations, and applications to image and video classification, and speech and text processing
\(~\)
\(~\)
\(~\)
\(~\)
So we first need to understand: what is a neural network?
Neuron and myelinated axon, with signal flow from inputs at dendrites to outputs at axon terminals. Image credits: By Egm4313.s12 (Prof. Loc Vu-Quoc) https://commons.wikimedia.org/w/index.php?curid=72816083
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
According to Chollet and Allaire (2018) (page 19):
Machine learning isn’t mathematics or physics, where major advancements can be done with a pen and a piece of paper. It’s an engineering science.
\(~\)
\(~\)
\(~\)
Recapitulate from Module 3 with the bodyfat dataset that contained the following variables.
bodyfat: % of body fat.age: age of the person.weight: body weighth.height: body height.neck: neck thickness.bmi: bmi.abdomen: circumference of abdomen.hip: circumference of hip.We will now look at modelling the bodyfat as response and using all other variables as covariates - this will give us
Let \(n\) be the number of observations in the training set, here \(n=243\).
(from Module 3)
\(~\)
We assume \[ Y_i=\beta_0 + \beta_1 x_{i1}+\beta_2 x_{i2}+\cdots + \beta_p x_{ip}+\varepsilon_i={\boldsymbol x}_i^T{\boldsymbol\beta}+\varepsilon_i \ , \]
for \(i=1,\ldots,n\), where \(x_{ij}\) is the value \(j\)th predictor for the \(i\)th datapoint, and \({\boldsymbol\beta}^\top = (\beta_0,\beta_1,\ldots,\beta_p)\) the regression coeffficients.
\(~\)
We used the compact matrix notation for all observations \(i=1,\ldots,n\) together: \[{\boldsymbol Y}={\boldsymbol {X}} \boldsymbol{\beta}+{\boldsymbol{\varepsilon}} \ .\]
Assumptions:
The classical normal linear regression model is obtained if additionally
\(~\)
How can our statistical model be represented as a network?
\(~\)
We need :
\(~\)
\(~\)
\(~\)
\(~\)
## # weights: 8
## initial value 133968.072591
## iter 10 value 4470.677918
## final value 4415.453729
## converged
\[\begin{equation*} Y_i=\beta_0 + \beta_1 x_{i1}+\beta_2 x_{i2}+\cdots + \beta_p x_{ip}+\varepsilon_i \ . \end{equation*}\]
\(~\)
\(~\)
In the statistics world
we would have written \(\hat{y}_1({\boldsymbol x}_i)\) to specify that we are estimating a predicted value of the response for the given covariate value.
we would have called the \(w\)s \(\hat{\beta}\)s instead.
\(~\)
Remember: The estimator \(\hat{\boldsymbol \beta}\) is found by minimizing the RSS for a multiple linear regression model: \[ \begin{aligned} \text{RSS} &=\sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_{i1} - \hat \beta_2 x_{i2} -...-\hat \beta_p x_{ip} )^2 \\ &= \sum_{i=1}^n (y_i-{\boldsymbol x}_i^T \boldsymbol \beta)^2=({\boldsymbol Y}-{\boldsymbol X}\hat{\boldsymbol{\beta}})^T({\boldsymbol Y}-{\boldsymbol X}\hat{\boldsymbol{\beta}}) \ .\end{aligned} \] Solution: \[ \hat{\boldsymbol\beta}=({\boldsymbol X}^T{\boldsymbol X})^{-1} {\boldsymbol X}^T {\boldsymbol Y} \ .\]
We now translate from the statistical into the neural networks world:
\(~\)
\(~\)
\(~\)
\(~\)
(https://github.com/SoojungHong/MachineLearning/wiki/Gradient-Descent)
\(~\)
Here we compare
lmnnet \(~\)
Linear regression vs. neural networks: an example.
\(~\)
fit = lm(bodyfat ~ age + weight + height + bmi + neck + abdomen + hip,
data = d.bodyfat)
fitnnet = nnet(bodyfat ~ age + weight + height + bmi + neck + abdomen +
hip, data = d.bodyfat, linout = TRUE, size = 0, skip = TRUE, maxit = 1000,
entropy = FALSE)
## # weights: 8
## initial value 1155864.852028
## iter 10 value 4471.081708
## final value 4415.453729
## converged
cbind(fitnnet$wts, fit$coefficients)
## [,1] [,2]
## (Intercept) -9.748903e+01 -9.748903e+01
## age -9.607669e-04 -9.607669e-04
## weight -6.292821e-01 -6.292820e-01
## height 3.974884e-01 3.974884e-01
## bmi 1.785330e+00 1.785330e+00
## neck -4.945725e-01 -4.945725e-01
## abdomen 8.945189e-01 8.945189e-01
## hip -1.255549e-01 -1.255549e-01
Aim is to predict if a person has diabetes. The data stem from a population of women of Pima Indian heritage in the US, available in the R MASS package. The following information is available for each woman:
\(~\)
diabetes: 0= not present, 1= presentnpreg: number of pregnanciesglu: plasma glucose concentration in an oral glucose tolerance testbp: diastolic blood pressure (mmHg)skin: triceps skin fold thickness (mm)bmi: body mass index (weight in kg/(height in m)\(^2\))ped: diabetes pedigree function.age: age in years\(~\)
\(~\)
\(~\)
(Maximum likelihood)
\[ \ln(L(\boldsymbol{\beta}))=l(\boldsymbol{\beta}) =\sum_{i=1}^n \Big ( y_i \ln p_i + (1-y_i) \ln(1 - p_i )\Big ) \ .\]
\(~\)
\(~\)
\[ \hat{y}_1({\boldsymbol x}_i)= \sigma({\boldsymbol x}_i) = \frac{1}{1+\exp(-(w_0+w_1 x_{i1}+\cdots + w_r x_{ir}))} \in (0,1) \ . \]
\(~\)
\(~\)
\(~\)
fitlogist = glm(diabetes ~ npreg + glu + bp + skin + bmi + ped + age,
data = train, family = binomial(link = "logit"))
summary(fitlogist)
##
## Call:
## glm(formula = diabetes ~ npreg + glu + bp + skin + bmi + ped +
## age, family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9830 -0.6773 -0.3681 0.6439 2.3154
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.773062 1.770386 -5.520 3.38e-08 ***
## npreg 0.103183 0.064694 1.595 0.11073
## glu 0.032117 0.006787 4.732 2.22e-06 ***
## bp -0.004768 0.018541 -0.257 0.79707
## skin -0.001917 0.022500 -0.085 0.93211
## bmi 0.083624 0.042827 1.953 0.05087 .
## ped 1.820410 0.665514 2.735 0.00623 **
## age 0.041184 0.022091 1.864 0.06228 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 256.41 on 199 degrees of freedom
## Residual deviance: 178.39 on 192 degrees of freedom
## AIC: 194.39
##
## Number of Fisher Scoring iterations: 5
set.seed(787879)
library(nnet)
fitnnet = nnet(diabetes ~ npreg + glu + bp + skin + bmi + ped + age,
data = train, linout = FALSE, size = 0, skip = TRUE, maxit = 1000,
entropy = TRUE, Wts = fitlogist$coefficients + rnorm(8, 0, 0.1))
## # weights: 8
## initial value 213.575955
## iter 10 value 89.511044
## final value 89.195333
## converged
# entropy=TRUE because default is least squares
cbind(fitnnet$wts, fitlogist$coefficients)
## [,1] [,2]
## (Intercept) -9.773046277 -9.773061533
## npreg 0.103183171 0.103183427
## glu 0.032116832 0.032116823
## bp -0.004767678 -0.004767542
## skin -0.001917105 -0.001916632
## bmi 0.083624151 0.083623912
## ped 1.820397792 1.820410367
## age 0.041183744 0.041183529
\(~\)
By setting entropy=TRUE we minimize the cross-entropy loss.
plotnet(fitnnet)
But, there may also exist local minima.
\(~\)
set.seed(123)
fitnnet = nnet(diabetes ~ npreg + glu + bp + skin + bmi + ped + age,
data = train, linout = FALSE, size = 0, skip = TRUE, maxit = 10000,
entropy = TRUE, Wts = fitlogist$coefficients + rnorm(8, 0, 1))
## # weights: 8
## initial value 24315.298582
## final value 12526.062906
## converged
cbind(fitnnet$wts, fitlogist$coefficients)
## [,1] [,2]
## (Intercept) -36.733537 -9.773061533
## npreg -77.126994 0.103183427
## glu -2984.409175 0.032116823
## bp -1835.934259 -0.004767542
## skin -718.072629 -0.001916632
## bmi -818.561311 0.083623912
## ped -8.687473 1.820410367
## age -773.023878 0.041183529
Why can NN and logistic regression lead to such different results?
\(~\)
The iris flower data set was introduced by the British statistician and biologist Ronald Fisher in 1936.
\(~\)
Sepal.Length, Sepal.Width, Petal.Length and Petal.Width.\(~\)
The aim is to predict the species of an iris plant.
\(~\)
We only briefly mentioned multiclass regression in module 4.
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
First select a training sample
\(~\)
library(nnet)
set.seed(123)
train = sample(1:150, 50)
iris_train = ird[train, ]
iris_test = ird[-train, ]
\(~\)
Then fit the nnet() (by default using the softmax activation function)
set.seed(1234)
iris.nnet <- nnet(species ~ ., data = ird, subset = train, size = 0,
skip = TRUE, maxit = 100)
## # weights: 15
## initial value 105.595764
## iter 10 value 1.050064
## iter 20 value 0.018814
## iter 30 value 0.003937
## iter 40 value 0.002062
## iter 50 value 0.001460
## iter 60 value 0.000150
## iter 70 value 0.000125
## iter 80 value 0.000110
## final value 0.000096
## converged
How many weights (parameters) have been estimated?
What does the graph look like?
\(~\)
summary(iris.nnet)
## a 4-0-3 network with 15 weights
## options were - skip-layer connections softmax modelling
## b->o1 i1->o1 i2->o1 i3->o1 i4->o1
## 36.79 9.69 1.26 -14.33 -13.24
## b->o2 i1->o2 i2->o2 i3->o2 i4->o2
## 5.16 10.10 21.33 -25.47 -12.48
## b->o3 i1->o3 i2->o3 i3->o3 i4->o3
## -42.03 -20.25 -23.12 40.80 25.96
Multinomial regression is also done with nnet, but using a wrapper multinom (we use default settings, so results are not necessarily the same as above).
library(caret)
fit = multinom(species ~ -1 + ., family = multinomial, data = iris_train)
## # weights: 15 (8 variable)
## initial value 54.930614
## iter 10 value 4.353139
## iter 20 value 0.139411
## iter 30 value 0.065218
## iter 40 value 0.056419
## iter 50 value 0.045548
## iter 60 value 0.020867
## iter 70 value 0.016116
## iter 80 value 0.012952
## iter 90 value 0.012787
## iter 100 value 0.009090
## final value 0.009090
## stopped after 100 iterations
coef(fit)
## Sepal.L. Sepal.W. Petal.L. Petal.W.
## s -5.91976 21.30247 -12.52073 -2.774511
## v -39.93182 -28.07162 53.73326 41.264390
Problem: multinom() seems to fit an intercept plus an offset node (B), thus we have to remove the intercept manually (by saying -1 in the above formula).
\(~\)
\(~\)
\(~\)
\(~\) But:
These are only linear models (linear boundaries).
Parameters (weights) found using gradient descent algorithms where the learning rate (step length) must be set.
Connections are only forward in the network, but no feedback connections that sends the output of the model back into the network.
Examples: Linear, logistic and multinomial regression with or without any hidden layers (between the input and output layers).
We may have no hidden layer, one (to be studied next), or many.
Adding hidden layers with non-linear activation functions between the input and output layer will make nonlinear statistical models.
The number of hidden layers is called the depth of the network, and the number of nodes in a layer is called the width of the layer.
test
\(~\)
\(~\)
\(~\)
The says that a feedforward network with
can approximate any (Borel measurable) function from one finite-dimensional space (our input layer) to another (our output layer) with any desired non-zero amount of error.
In particular, the universal approximation theorem holds for
in the hidden layer.
nnet and keras R packages\(~\)
We will use both the rather simple nnet R package by Brian Ripley and the currently very popular keras package for deep learning (the keras package will be presented later).
nnet fits one hidden layer with sigmoid activiation function. The implementation is not gradient descent, but instead BFGS using optim.
Type ?nnet() into your R-console to see the arguments of nnet().
If the response in formula is a factor, an appropriate classification network is constructed; this has one output and entropy fit if the number of levels is two, and a number of outputs equal to the number of classes and a softmax output stage for more levels.
Objective: To predict the median price of owner-occupied homes in a given Boston suburb in the mid-1970s using 10 input variables.
This data set is both available in the MASS and keras R package.
\(~\)
keras library)library(keras)
dataset <- dataset_boston_housing()
c(c(train_data, train_targets), c(test_data, test_targets)) %<-% dataset
str(train_targets)
## num [1:404(1d)] 15.2 42.3 50 21.1 17.7 18.5 11.3 15.6 15.6 14.4 ...
head(train_data)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 1.23247 0.0 8.14 0 0.538 6.142 91.7 3.9769 4 307 21.0 396.90
## [2,] 0.02177 82.5 2.03 0 0.415 7.610 15.7 6.2700 2 348 14.7 395.38
## [3,] 4.89822 0.0 18.10 0 0.631 4.970 100.0 1.3325 24 666 20.2 375.52
## [4,] 0.03961 0.0 5.19 0 0.515 6.037 34.5 5.9853 5 224 20.2 396.90
## [5,] 3.69311 0.0 18.10 0 0.713 6.376 88.4 2.5671 24 666 20.2 391.43
## [6,] 0.28392 0.0 7.38 0 0.493 5.708 74.3 4.7211 5 287 19.6 391.13
## [,13]
## [1,] 18.72
## [2,] 3.11
## [3,] 3.26
## [4,] 8.01
## [5,] 14.65
## [6,] 11.74
\(~\)
The column names are missing (we could get them by using the Boston dataset loaded from the MASS library, but they are not relevant here).
\(~\)
org_train = train_data
mean <- apply(train_data, 2, mean)
std <- apply(train_data, 2, sd)
train_data <- scale(train_data, center = mean, scale = std)
test_data <- scale(test_data, center = mean, scale = std)
\(~\)
Just checking out one hidden layer with 5 units to get going.
library(nnet)
fit5 <- nnet(train_targets ~ ., data = train_data, size = 5, linout = TRUE,
maxit = 1000, trace = F)
pred = predict(fit5, newdata = test_data, type = "raw")
sqrt(mean((pred[, 1] - test_targets)^2))
## [1] 6.167934
mean(abs(pred[, 1] - test_targets))
## [1] 4.271899
\(~\)
We now focus on the different elements of neural networks.
\(~\)
\(~\)
These choices have been guided by solutions in statistics (multiple linear regression, logistic regression, multiclass regression)
\(~\)
: for continuous outcome (regression problems)
: for binary outcome (two-class classification problems)
: for multinomial/categorical outcome (multi-class classification problems)
\(~\)
Remark: it is important that the output activation is matched with an appropriate loss function (see 4).
\(~\)
Network architecture contains three components:
\(~\)
: How many nodes are in each layer of the network?
: How deep is the network (how many hidden layers)?
: How are the nodes connected to each other?
\(~\)
This depends on the problem, and here experience is important.
\(~\)
However, the recent practice is to
\(~\)
\(~\)
This simplifies the choice of network architecture to choose a large enough network.
\(~\)
The choice of the loss function is closely related to the output layer activation function.
To sum up, the popular problem types, output activation and loss functions are:
| Problem | Output activation | Loss function |
|---|---|---|
| Regression | linear |
mse |
| Classification (C=2) | sigmoid |
binary_crossentropy |
| Classification (C>2) | softmax |
categorical_crossentropy |
Due to how estimation is done (see below), the loss functions chosen “need” to be:
\(~\)
\(~\)
Let the unknown parameters be denoted \({\boldsymbol \theta}\) (what we have previously denotes as \(\alpha\)s and \(\beta\)s), and the loss function to be minimized \(J({\boldsymbol \theta})\).
\(~\)
Gradient descent
Mini-batch stochastic gradient descent (SGD) and true SGD
Backpropagation
\(~\)
Remember:
Given the gradient \(\nabla J({\boldsymbol \theta}^{(t)})\) of the loss function evaluated at the current estimate \({\boldsymbol \theta}^{(t)}\), then the algorithm estimates the parameter at the next step as:
\[{\boldsymbol \theta}^{(t+1)}={\boldsymbol \theta}^{(t)} - \lambda \nabla_{\boldsymbol \theta} J({\boldsymbol \theta}^{(t)}) \ , \]
where \(\lambda\) is the learning rate (usually a small value). In keras the default learning rate is \(0.01\).
Q: Why are we moving in the direction of the negative of the gradient? Why not the positive?
A:
In , the loss function is computed as a mean over all training examples. \[ J({\boldsymbol \theta})=\frac{1}{n}\sum_{i=1}^n J({\boldsymbol x}_i, y_i) \ . \]
The gradient is an average over many individual gradients from the training example. You can think of this as an estimator for an expectation.
\[ \nabla_{\boldsymbol \theta} J({\boldsymbol \theta})=\frac{1}{n}\sum_{i=1}^n \nabla_{\boldsymbol \theta} J({\boldsymbol x}_i, y_i) \ . \]
Advantages:
The optimizer will converge much faster if they can rapidly compute approximate estimates of the gradient, instead of slowly computing the exact gradient (using all training data).
Mini-batches may be processed in parallel, and the batch size is often a power of 2 (32 or 256).
Small batches also bring in a regularization effect, maybe due to the variability they bring to the optimization process.
In the 3th video (on backpropagation) from 3Blue1Brown there is nice example of one trajectory from gradient decent and one from SGD (10:10 minutes into the video): https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=3
Mini-batch stochastic gradient descent
Special case: involves only (mini-batch size 1). \(\rightarrow\) Mini-batch SGD is a compromise between SGD (one sample per iteration) and full gradient descent (full dataset per iteration)
\(~\)
More background:
\(~\)
\(~\)
The learning rate \(\lambda\) is difficult to set. Some ideas for adaptive learning rates (Goodfellow et al, Chapter 8.5).
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\[ \tilde{J}({\boldsymbol w})= \frac{\alpha}{2}{{\boldsymbol w}^\top{\boldsymbol w}} + J({\boldsymbol w}) \ .\]
Based on Goodfellow, Bengio, and Courville (2016), Section 7.8
\(~\)
Based on Goodfellow, Bengio, and Courville (2016), Section 7.12, and Chollet and Allaire (2018) 4.4.3
\(~\)
Dropout was developed by Geoff Hinton and his students.
\(~\)
During training: randomly dropout (set to zero) some outputs in a given layer at each iteration. Drop-out rates may be chosen between 0.2 and 0.5.
During test: no dropout, but scale done the layer output values by a factor equal to the drop-out rate (since now more units are active than we had during training).
Alternatively, the drop-out and scaling (now upscaling) can be done during training.
One way to look at dropout is on the lines of what we did in Module 8 when we used bootstrapping to produced many data sets and then fitted a model to each of them and then took the average (bagging). But randomly dropping out outputs in a layer, this can be looked as mimicking bagging – in an efficient way.
\(~\)
Hyperparameters: The network architecture, the number of batches to run before terminating the optimization, the drop-out rate.
\(~\)
Ways to avoid overfitting:
Reduce network size.
Collect more observations.
Regularization.
\(~\)
It is important that the hyperparameters are chosen on a validation set or by cross-validation.
However, a “popular” term is validation-set overfitting and refers to using the validation set to decide many hyperparameters, so many that you may effectively overfit the validation set.
Deep Learning is an algorithm which has no theoretical limitations of what it can learn; the more data you give and the more computational time you provide, the better it is.
Geoffrey Hinton (Google)
(based on Chollet and Allaire (2018))
\(~\)
\(~\)
\(~\)
\(~\)
Keras is a high-level neural networks API developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.
More information on the R solution: https://keras.rstudio.com/
Cheat-sheet for R Keras: https://github.com/rstudio/cheatsheets/raw/master/keras.pdf
\(~\)
\(~\)
R keras cheat sheet.Objective: classify the digit contained in an image (128 \(\times\) 128 greyscale).
Labels for the training data:
## train_labels
## 0 1 2 3 4 5 6 7 8 9
## 5923 6742 5958 6131 5842 5421 5918 6265 5851 5949
\(~\)
60 000 images for training and 10 000 images for testing.
\(~\)
# Training data
train_images <- mnist$train$x
train_labels <- mnist$train$y
# Test data
test_images <- mnist$test$x
test_labels <- mnist$test$y
org_testlabels <- test_labels
\(~\)
The train_images is a tensor (generalization of a matrix) with 3 axis, (samples, height, width).
Using keras_model_sequential() we build a model with a stack of layers. layer_dense() adds densely connected layers on top of the input layer. Each sample contains 28*28=784 pixels (= input nodes) and 10 features (=output nodes). Adding a bias term (intercept) is default for layer_dense.
network <- keras_model_sequential() %>% layer_dense(units = 512, activation = "relu",
input_shape = c(28 * 28)) %>% layer_dense(units = 10, activation = "softmax")
summary(network)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense (Dense) (None, 512) 401920
## ________________________________________________________________________________
## dense_1 (Dense) (None, 10) 5130
## ================================================================================
## Total params: 407,050
## Trainable params: 407,050
## Non-trainable params: 0
## ________________________________________________________________________________
As activation function we use relu for the hidden layer, and softmax for the output layer - since we have 10 classes (where one is correct each time).
\(~\)
We choose
The loss function we will use, which is cross-entropy.
The version of the optimizer, which is RMSprop.
Which measure (metrics) do we want to monitor in our training phase? Here we choose accuracy (=percentage correctly classified).
\(~\)
network %>% compile(optimizer = "rmsprop", loss = "categorical_crossentropy",
metrics = c("accuracy"))
\(~\)
The training data is scored in an array of dimension (60000, 28, 28) of type integer with values in the [0, 255] interval.
The data must be reshaped into the shape the network expects (\(28\cdot 28\)). In addition the grey scale values are scales to be in the [0, 1] interval.
Also, the response must be transformed from 0-10 to a vector of 0s and 1s (dummy variable coding) aka one-hot-coding.
\(~\)
train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images/255
train_labels <- to_categorical(train_labels)
test_images <- array_reshape(test_images, c(10000, 28 * 28))
test_images <- test_images/255
test_labels <- to_categorical(test_labels)
\(~\)
fitted<-network %>% fit(train_images, train_labels,
epochs = 30, batch_size = 128)
library(ggplot2)
plot(fitted)+ggtitle("Fitted model")
epoch defines how many times you go through your training set (using mini-batches). See here for some documentation.
\(~\)
network %>% evaluate(test_images, test_labels)
testres = network %>% predict_classes(test_images)
# $loss [1] 0.1194063 $acc [1] 0.9827
confusionMatrix(factor(testres), factor(org_testlabels))
Confusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 971 0 3 0 1 2 5 1 1 1
1 1 1128 2 0 0 0 2 2 2 3
2 1 1 1009 3 3 0 2 7 2 0
3 0 1 2 997 0 11 1 3 4 3
4 1 0 2 0 969 1 4 2 3 7
5 0 1 0 1 0 871 3 0 1 4
6 3 2 2 0 3 4 940 0 1 0
7 1 1 4 2 1 0 0 1002 2 3
8 2 1 7 0 0 2 1 4 953 1
9 0 0 1 7 5 1 0 7 5 987
Overall Statistics
Accuracy : 0.9827
95% CI : (0.9799, 0.9852)
No Information Rate : 0.1135
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9808
Mcnemar's Test P-Value : NA
\(~\)
taken from Chollet and Allaire (2018): https://www.math.ntnu.no/emner/TMA4268/2018v/11NN/11-neural_networks_boston_housing.html
relu, linear, sigmoid) inner products - sequentially connected.nnet and keras in R; Piping sequential layers, piping to estimation and then to evaluation (metrics).Chollet, François, and J. J. Allaire. 2018. Deep Learning with R. Manning Press. https://www.manning.com/books/deep-learning-with-r.
Efron, Bradley, and Trevor Hastie. 2016. Computer Age Statistical Inference - Algorithms, Evidence, and Data Science. Cambridge University Press.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer series in statistics New York.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.